Attention: The data have been uploaded into our github project reposotory, together with some intermediate results (some model fitting results and plots that are time-consuming). Some of the intermediate results are a little bit large (around 200 MB). The aim is to accelerate the knit speed of this .rmd file. You should be able to finish the knit process within 5 minutes after downloading all files in this branch of repository. The website construction files are in the other branch “gh-pages”.

The project website: https://churchillcui.github.io/ads_retail/

The project screencast: https://www.youtube.com/watch?v=__oHjvxeCI4

Overview

Background and Motivation

As the world’s largest economy by nominal GDP and net wealth, the United States has been famous for its prosperous retail industry over decades. Many retailers in the United States have earned their reputations all over the world, such as Walmart, Target, etc. Among all the job positions in the large retailers, sales managers play a key role in the company, since such a competitive industry has been putting consecutively high expectations for them to adjust for rapid social innovation and keep the company’s retail sales moving in the right direction. As sales managers, they are likely to be interested in not only getting prepared for the future sales, but also balancing the online and offline sales, as e-commerce concept is getting more and more popular in recent years. Many questions are arising from them every day: What are the factors that affect sales? Can we predict future sales? How do nationwide economic events influence the whole retail industry? What’s new in this e-commerce era?

These hard problems from retail sales managers call for a more systematic and scientific approach to model different sales types with potential influential factors. Unfortunately, there has long been a gap between the need for sales managers and retail sales data. As discussed in this article talking about retail trends and predictions for 2018, data will (and should) drive retail decisions more than ever. To cover such a gap, we conducted this data science project using retail sales data, trying to provide comprehensive advice for retail sales managers.

Objective

Through this project, we sought to answer four primary questions for retail sales managers, specified as the target audience of our project:

  • What are the factors that affect the offline sales?

  • How to predict future offline sales?

  • How do nationwide economic events affect the whole retail industry?

  • What are the different patterns of online sales compared with offline sales?

Approach

The questions we sought to answer suggest the analysis being separated into two subprojects: one (first two questions) focus on Walmart retail sales and related features, and the other one (the last two questions) focus on the general retail sales pattern in the United States. These two subprojects together help not only answering questions with respect to specific features that retail sales managers may be interested in (for example, would reducing temperature increase retail sales?), but also describing some general patterns of online and offline sales, serving as guidance for all retail sales managers in our arrived e-commerce world.

In the first subproject, we started by exploring feature correlations and temporal distributions of retail sales individually, then exploring their joint relationship. Some advanced statistical methods are then applied, including linear model with stepwise regression, generalized estimating equations, and generalized linear models, to extract key factors that have significant influences on sales outcome. The features collected, together with statistical knowledge of tree-based methods, informed the creation of three advanced tree-based predictive algorithms. We then compared the prediction accuracy and time usage between these tree-based models and traditional statistical baseline model (GLM in our case) to provide some summarized suggestions for sales managers.

In the second subproject, we modeled the temporal trend differences between online and offline sales using splines, and quantified the effect of financial crisis in 2008 by testing the significance of spline coefficients. Their respective seasonal patterns were further analyzed through advanced statistical techniques, such as F test.

Data

Multiple data sources have been extracted and summarized as spreadsheets and organized in our project. Considering the target audience as retail sales managers, we aimed at exploring and studying possible retail sales data with related factors from different sources.

In preparation for the first subproject, we sought to find open sales data from large retailers such as Walmart. Our first attempt was to obtain sales data from the Walmart website. Some summarized financial information are available on the website, such as the 13-week period 4-5-4 comparable store sales of each fiscal year. However, the detailed sales data seems unavailable for public use. Therefore, we turned to look for some other sources on the Internet. We found Walmart’s net sales worldwide from 2006 to 2019 on the Statista. However, the scale was too vague (worldwide overall sales) to perform detailed analysis. Fortunately, we found that Walmart shared a dataset in 2013 through kaggle, which contains the actual historical sales data for 45 selected Walmart stores in the United Stores. Although this dataset only contains limited information about Walmart sales and is not complete, it helped as a good place to start, covering information about both sales and many store features.

To have a close look at the dataset, five .csv files are provided in total, named as “features.csv”, “sampleSubmission.csv”, “stores.csv”, “test.csv” and “train.csv” respectively. As the goal of our first subproject is not to finish the competition on kaggle, but to have a comprehensive analysis of retail sales, we focus on three data frames of them: “features.csv”, “stores.csv”, and “train.csv”. The rest two files are only for competition submission purpose.

“features.csv” contains some temporal features of each store with 8191 rows and 12 columns. The features include the store number (Store), the week (Date), average temperature in the region (Temperature), cost of fuel in the region (Fuel_Price), anonymized markdown data (MarkDown1-5), Consumer Price Index (CPI), the unemployment rate (Unemployment), and whether the week is a special event week (IsHoliday). The special events refer to Super Bowl, Labor Day, Thanksgiving and Christmas every year. “stores.csv” contains non-temporal features of stores with 45 rows and 3 columns, including the number (Store), type (Type) and size (Size) of the store. “train.csv” contains the historical weekly sales of each department at each store with 421570 rows and 5 columns, ranging from February 5, 2010, to November 1, 2012.

Although the data provide historical sales data of Walmart together with some temporal and non-temporal features, we discover some limitations of this dataset. Several other external sources are pursued to handle some of them, illustrated in the next paragraph, while some remained unsolvable in our analysis. For example, the location of each store is anonymized, which hampers the adjustment of retail sales based on local sociodemographic factors such as the population. Meanwhile, the anonymized markdown data making itself hard to use in the analysis, and we also observe a large amount of missing data in this markdown feature in the exploration. As a result, we decide not to use these markdown data in our project, as no inference could be drawn with only some vague explanations from data provider.

To study the characteristics of online and offline retail sales, we obtained the retail sales data in the United States from 4th quarter 1999 to 3rd quarter 2019, originally provided by the US Census Bureau. Estimates are based on data from the Monthly Retail Trade Survey and administrative records. In this dataset, E-commerce sales are sales of goods and services where an order is placed by the buyer or price and terms of sales are negotiated over an online system. We assumed the E-commerce retail sales in this dataset can represent the online sales, and the non-E-commerce (the difference of total retail sales and E-commerce retail sales) can represent the offline sales.

To consider other potential factors, we referred to some articles online and found this article discussed five important variables that affect retail business success. Apart from the competition and technology underlined with online shopping, the general economic status was mentioned as another factor that related to retail sales. Based on this suggestion, we also include the GDP data of the United States in the model. As the politics and consumer trend factors are hard to infer from current data, we leave them for future research extension.

With multiple data sources, we managed to combine and re-organize them into analyzable form via standard tidying and wrangling process.

Analysis

Walmart retail sales inference and prediction

Data wrangling and tidying

We first loaded some R packages used in the project.

library(tidyverse)
library(splines)
library(caret)
library(readxl)
library(lubridate)
library(doMC)
library(xgboost)
library(randomForest)
library(neuralnet)
library(gee)
library(MASS)
library(scales)
library(ggpubr)
library(reshape2)
library(stringr)
library(readr)
library(ggplot2)
library(gridExtra)

Next, we read the .csv files discussed in the previous section.

## weekly sales of each department of each store
train <- read_csv("./Data/train.csv")

## temporal features of each store
features <- read_csv("./Data/features.csv")

## type and size information of each store
stores <- read_csv("./Data/stores.csv")

Although the read_csv function from the readr package will automatically convert .csv files into tibbles, some default convert format are not appropriate. Thus, we modified some columns to the desired format.

## weekly sales data
train$Store <- as.factor(train$Store)
train$Dept <- as.factor(train$Dept)

## temporal feature data
features$Store <- as.factor(features$Store)

## store information data
stores$Store <- as.factor(stores$Store)
stores$Type <- as.factor(stores$Type)

As mentioned in the data section, we used some external sources to build the data frame for the formal analysis, though the data in our second subproject are analyzed relatively independently with different time intervals. Note that those data have already been extracted or re-organized into .csv or .xlsx file before loading to R.

## read GDP data
GDP <- read_csv("./Data/GDP.csv")

The GDP is considered as one of the important variables that affects retail sales. Temporal indicators, including year and quarter, are also added.

## economy status data
GDP$Year <- c(rep(2010, 4), rep(2011, 4), rep(2012, 4), rep(2013,3))
GDP$Quarter <- c(c(1,2,3,4), c(1,2,3,4), c(1,2,3,4), c(1,2,3))

After finishing cleaning each separate data, we next combine them into one data frame, recording the features and sales of each department of each store at each time.

## combine and re-organize data from multiple sources
### combine weekly sales and store information
data_combined <- left_join(train, stores, key = "Store")

### combine temporal features of store
data_combined <- unite(data_combined, "Store-Date", c("Store", "Date"), sep = "-", remove = FALSE)
feature_temp <- unite(features, "Store-Date", c("Store", "Date"), sep = "-", remove = FALSE)
data_combined <- left_join(data_combined, feature_temp, key = "Store-Date")

### combine economy status information
data_combined$Year <- year(data_combined$Date)
data_combined$Quarter <- quarter(data_combined$Date)
GDP$DATE <- NULL
data_combined <- left_join(data_combined, GDP, by = c("Year", "Quarter"))

Since the temporal trend is a very important factor that may affect offline retail sales, we next include some more time indicators with different scales. Some of them (Year, Quarter) have been added in the previous tidying process. As some of them are highly correlated, we later performed forward and backward selections in the model phase, described in detail in the inference section.

## include some temporal variables with different scales
data_combined$Month <- month(data_combined$Date)
data_combined$Day <- day(data_combined$Date)
### set Jan 1, 2010 as the baseline, we include the days of each record to baseline, named as Day_To_Begin
data_combined$Day_To_Begin <- as.numeric(data_combined$Date - as.Date("2010-01-01"))

After combining datasets from multiple resources to one data frame, we perform some further clean work on this data frame by removing columns that could not be well interpreted due to data limitation (e.g. anonymized markdown data) or that used only for data integration purpose (e.g. “Store-Date”). The cleaned data frame is then renamed as data_analysis and is leveraged along the rest of our project.

Note that we do not include online sales data in the inference and prediction of the first subproject, but instead leave them for the second subproject. The reason is that, it is not the total sales that influence retail sales, but the retail sales in Walmart is a component of total sales. As a result, the unknown population information due to location anonymization will lead to great bias.

data_analysis <- data_combined[,c(2:10, 16:23)]

After our data tidying and wrangling process, the integrated data frame data_analysis has 421570 rows and 17 columns.

Exploratory data analysis

In the exploratory data analysis part, we are going to explore the temporal trend of different covariates and possible interactions between these covariates using tidied data.

Heatmap of correlation between covariates

The potential covariates that may influence the weekly sales are time information, CPI, GDP, temperature, unemployment rate, fuel price, type and size of the store. First we are going to look at the correlation heatmap of these covariates to see if there is any dependence among these covariates.

# select potential covariates 
data.heatmap <- data_analysis[,c("Temperature", "CPI", "GDP", "Unemployment", "Fuel_Price", "Year" ,"Month", "Day_To_Begin","Quarter")]
# change day difference format
data.heatmap$Day_To_Begin <- as.numeric(data.heatmap$Day_To_Begin)
# get corrlation matrix
cormat <- round(cor(data.heatmap), 2)
reorder_cormat <- function(cormat){
# use correlation between variables as distance
    dd <- as.dist((1-cormat)/2)
    hc <- hclust(dd)
    cormat <- cormat[hc$order, hc$order]
}
get_upper_tri <- function(cormat){
# get upper triangle matrix
    cormat[lower.tri(cormat)]<- NA
    return(cormat)
}
cormat <- reorder_cormat(cormat)
upper_tri <- get_upper_tri(cormat)
melted_cormat <- melt(upper_tri, na.rm=T)
# plot the correlation
ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
    name="Pearson\nCorrelation") +
  theme_minimal() + # minimal theme
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1)) +
 coord_fixed() +
  geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  panel.grid.major = element_blank(),
  panel.border = element_blank(),
  panel.background = element_blank(),
  axis.ticks = element_blank(),
  legend.justification = c(1, 0),
  legend.position = c(0.6, 0.7),
  legend.direction = "horizontal")+
  guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
                title.position = "top", title.hjust = 0.5))

From the heatmap, we observed that there is almost no dependence among these covariates, which is good for the use of our linear model and generalized linear model.

Weekly sales

First we are going to look at average sales of all 45 stores. We summarise the data by calculating average weekly sales in these three years and plot the temporal trend.

# summarise average sales
dat.avg.store <- data_analysis %>%
  group_by(Date) %>%
  summarise(sales=mean(Weekly_Sales))

# time - average sales plot
ggplot(data=dat.avg.store, aes(x=Date, y=sales))+
  geom_line()+
  labs(x="Time",y="Average sales of all stores",title="Average sales of all stores temporal trend")+
  scale_x_date(date_breaks = "3 month",labels = date_format("%Y-%m"))

As we can see, the average sales have some small fluctuations in the first ten months of each year and have a outburst at the end of each year. This is possibly because of Thanksgiving and Christmas holidays.

Then we are going to look at averages sales in each store separately. We summarise the data by calculating the average sales across different department in each store in these three years. Then we are going to look at the relationship between average sales and different covariates, like temperature, CPI, GDP, Unemployment rate, fuel price. We stratify the store by type to visualize the relationship since type is also a potential covariate.

# summarise average sales in each store
dat.ave <- data_analysis %>%
  group_by(Store, Date, Type, IsHoliday, Fuel_Price, CPI, GDP, Unemployment, Temperature) %>%
  summarise(avesales = mean(Weekly_Sales))
# time - average sales plot stratified by type
p1 <- ggplot(data = dat.ave, aes(x = Date, y = avesales, color = Type)) + 
  geom_point(size = .8, alpha = 0.5) + 
  geom_smooth() +
  labs(x = "Time", y = "Average sales", title = "Average sales temporal trend")

# fuel price - average sales plot stratified by type
p2 <- ggplot(data = dat.ave, aes(x = Fuel_Price, y = avesales, color = Type)) + 
  geom_point(size = .8, alpha = 0.5) +
  geom_smooth() +
  labs(x = "Fuel Price", y = "Average sales", title = "Fuel Price-Average sales")

# temperature - average sales plot stratified by type
p3 <- ggplot(data = dat.ave, aes(x = Temperature, y = avesales, color = Type)) + 
  geom_point(size = .8, alpha = 0.5) +
  geom_smooth() +
  labs(x = "Temperature", y = "Average sales", title = "Temperature-Average sales")

# GDP - average sales plot stratified by type
p4 <- ggplot(data = dat.ave, aes(x = GDP, y = avesales, color = Type)) + 
  geom_jitter(position = position_jitter(0.2)) +
  geom_smooth() +
  labs(x = "GDP", y = "Average sales", title = "GDP-Average sales")

# CPI - average sales plot stratified by type
p5 <- ggplot(data = dat.ave, aes(x = CPI, y = avesales, color = Type))+
  geom_jitter(position = position_jitter(0.2)) +
  geom_smooth() +
  labs(x = "CPI", y = "Average sales", title = "CPI-Average sales")

# unemployment rate - average sales plot stratified by type
p6 <- ggplot(data = dat.ave, aes(x = Unemployment, y = avesales, color = Type)) + 
  geom_jitter(position = position_jitter(0.2)) + 
  geom_smooth() +
  labs(x = "Unemployment rate", y = "Average sales", title = "Unemployment rate-Average sales")

ggarrange(p1, p2, p3, p4, p5, p6, nrow = 3, ncol = 2)

From these plots, we can observe that type has interactions with these covariates (time, temperature, CPI, GDP, fuel price, unemployment rate). Therefore it is reasonable to include these interactions in our model. However, some interactions may not be linear.

# type - average sales plot stratified by isholiday indicator
ggplot(data = dat.ave)+ 
  geom_boxplot(aes(x = Type, y = log(avesales), color = IsHoliday)) +
  labs(x = "Type", y = "Average sales", title = "Boxplot of type-sales")

From this plot we can note that holiday effect exists in Type A and Type B stores. The difference is not very significant in Type C stores.

Model fitting and inference

Linear model and stepwise regression

First, we will construct a linear model to capture the features in the dataset. There are many covariates, such as store ID, type and size of the store, department, time information, holiday indicator, CPI, GDP, unemployment rate, fuel price, and temperature. To capture the time information, we first break the date into the year, month information separately.

# Q-Q plot for response
data.fit <- data_analysis
# make response positive
data.fit$Weekly_Sales <- data.fit$Weekly_Sales + abs(min(data.fit$Weekly_Sales)) + 2
data.fit$Month <- factor(data.fit$Month)

First, we adjust the response to all be positive. Then we take a logarithm transformation to downsize the sales.

# Q - Q plot for sales
qqnorm(log(data.fit$Weekly_Sales))

From the Q-Q (Quantile-Quantile) plot, we can see that the logarithm of adjusted weekly sales are not normally (Gaussian) distributed. Therefore, we decide to make adjustments on the dataset before we get started to use multiple statistical models.

We use box cox transformation to rewrite the data, making the normality more significant.

The covariates in the model include store ID, department, holiday indicator, type and size of the store, year, month, fuel price, temperature, CPI, unemployment rate, GDP, and the interaction term between type and time, fuel price, temperature, CPI, unemployment rate, GDP.

# box cox transformation
boxcox <- boxcox(data = data.fit,log(Weekly_Sales) ~ Store + Dept + IsHoliday + Type + Size 
                 + Temperature + Year + Month + Fuel_Price 
                 + CPI + Unemployment + GDP 
      + Type:Temperature + Type:Fuel_Price + Type:CPI + Type:Unemployment + Type: GDP
      + Type:Year + Type:Month + Type:Day_To_Begin + Type:Quarter)
# select best parameter to do box cox transformation
lambda <- boxcox$x # lambda values
lik <- boxcox$y # log likelihood values for SSE
bc <- cbind(lambda, lik) # combine lambda and lik
sorted_bc <- bc[order(-lik),] # values are sorted to identify the lambda value for the maximum log likelihood for obtaining minimum SSE
head(sorted_bc, n = 10)
##     X      lambda      lik
## 1   1 -0.10101010 -1377226
## 2   2 -0.06060606 -1377230
## 3   3 -0.14141414 -1377268
## 4   4 -0.02020202 -1377274
## 5   5  0.02020202 -1377356
## 6   6 -0.18181818 -1377362
## 7   7  0.06060606 -1377470
## 8   8 -0.22222222 -1377515
## 9   9  0.10101010 -1377613
## 10 10 -0.26262626 -1377733

From the result, we can see that the parameter in the box cox transformation should be -0.10101010. Therefore, we rewrite the response using box cox transformation and fit the linear regression model. After doing this, we use stepwise regression (both backward and forward selection) to pick the best model (minimizing AIC).

# transform response variable
data.fit$Weekly_Sales <- (log(data.fit$Weekly_Sales)^(-0.10101010) - 1)/(-0.10101010)
# full model
full.model <- lm(data = data.fit, Weekly_Sales ~ Store + Dept + IsHoliday + Type + Size 
                 + Temperature + Year + Month + Fuel_Price 
                 + CPI + Unemployment + GDP 
      + Type:Temperature + Type:Fuel_Price + Type:CPI + Type:Unemployment + Type: GDP
      + Type:Year + Type:Month + Type:Day_To_Begin + Type:Quarter)
# stepwise regression
step.model <- stepAIC(full.model, direction = "both", trace = FALSE)

However, when we check the model and plot the Q-Q plot for the residuals, we find an outlier in this dataset.

plot(step.model)

Therefore we delete the outlier in the dataset, and use selected covariates to refit the linear model as below.

# use the selected model and delete the outlier
final.model <- lm(data=data.fit[-267731,], Weekly_Sales ~ Store + Dept + IsHoliday + Temperature + Year + Month + Fuel_Price + CPI + Unemployment
      + Type:CPI + Type:Unemployment
      + Type:Year + Type:Month + Type:Day_To_Begin)

Again, we checked the residual plot and found that the residuals are not following the normality assumption. Therefore, the linear model does not capture the data features well, and we would need to turn to other models.

plot(final.model)

Generalized Estimating Equations

First, we calculate the summation of sales across different departments within a store on a given day. Then we treat each stores’ sales data as longitudinal observations and use generalized estimating equations to fit the model. The covariates are those we selected in the linear model, and the working correlation structure is set to be corstr = "exchangeable" since we assume sales is correlated between different days.

# get summarised data for gee
model_gee <- data_analysis %>% group_by(Store, Year, Month, GDP, Day, Quarter, Day_To_Begin, IsHoliday, Type, Temperature, Fuel_Price, CPI, Unemployment) %>% summarise(sales = sum(Weekly_Sales))
model_gee$Month <- factor(model_gee$Month)

gee.model1 <- gee(log(sales) ~ Store + IsHoliday + Temperature + Year + Month + Fuel_Price + CPI + Unemployment
      + Type:CPI + Type:Unemployment
      + Type:Year + Type:Month + Type:Day_To_Begin,
      id = factor(Store), data = model_gee, corstr = "exchangeable")
summary(gee.model1)$coefficients
## Warning in sqrt(diag(object$robust.variance)): NaNs produced
##                         Estimate   Naive S.E.       Naive z  Robust S.E.
## (Intercept)        -1.422416e+03 9.733313e+02  -1.461389763 1.882626e+02
## Store2              2.151274e-01 9.777424e-03  22.002459281 2.072335e-02
## Store3              1.005331e+03 1.466478e+03   0.685541120 4.809605e+02
## Store4              2.858392e-01 8.478455e-01   0.337135987 4.573619e+00
## Store5              1.005126e+03 1.466478e+03   0.685401470 5.292827e+02
## Store6             -1.099294e-02 3.363655e-02  -0.326815200 1.668451e-01
## Store7              1.006039e+03 1.466482e+03   0.686021927 5.181821e+02
## Store8             -5.544416e-01 5.673761e-02  -9.772029749 2.803415e-01
## Store9              1.005621e+03 1.466477e+03   0.685739232 5.898534e+02
## Store10             1.008040e+03 1.466487e+03   0.687384120 3.558344e+02
## Store11            -1.485769e-01 3.803652e-02  -3.906164299 1.881146e-01
## Store12             1.007478e+03 1.466491e+03   0.686999181 4.846340e+02
## Store13             2.644259e-01 8.499528e-01   0.311106557 4.561276e+00
## Store14             2.845381e-01 2.937709e-01   0.968571327 1.551740e+00
## Store15             1.006853e+03 1.466486e+03   0.686574943 5.113611e+02
## Store16             1.005920e+03 1.466480e+03   0.685941664 5.268946e+02
## Store17             1.007277e+03 1.466485e+03   0.686864804 5.008800e+02
## Store18             1.007415e+03 1.466487e+03   0.686957881 4.520135e+02
## Store19            -4.504968e-02 7.889000e-01  -0.057104421 4.215735e+00
## Store20             3.132840e-01 7.063122e-02   4.435488708 3.639402e-01
## Store21             1.006011e+03 1.466479e+03   0.686004349 4.182536e+02
## Store22             1.007304e+03 1.466486e+03   0.686882874 5.083839e+02
## Store23             1.007618e+03 1.466483e+03   0.687097909 4.072162e+02
## Store24            -1.026472e-01 7.903253e-01  -0.129879668 4.215988e+00
## Store25             1.006040e+03 1.466479e+03   0.686024088 2.770971e+02
## Store26            -4.035751e-01 7.898698e-01  -0.510938713 4.217533e+00
## Store27             1.557310e-01 7.504298e-01   0.207522370 4.012164e+00
## Store28            -7.983258e-02 8.774122e-01  -0.090986397 4.604740e+00
## Store29             1.006733e+03 1.466488e+03   0.686492392 4.261502e+02
## Store30            -5.762195e+02 2.098184e+03  -0.274627751 1.174169e+03
## Store31            -1.075787e-01 9.807124e-03 -10.969440135 2.033850e-02
## Store32            -2.589477e-01 2.238025e-01  -1.157037031 1.172086e+00
## Store33            -1.772266e+00 8.524693e-01  -2.078979856 4.556073e+00
## Store34            -4.284967e-01 8.607111e-01  -0.497840416 4.566320e+00
## Store35             1.007197e+03 1.466486e+03   0.686809892 5.049853e+02
## Store36            -1.434084e+00 1.798512e-02 -79.737223367 7.649036e-02
## Store37            -5.760377e+02 2.098184e+03  -0.274541023 9.226550e+02
## Store38            -5.772954e+02 2.098217e+03  -0.275136217 7.163146e+02
## Store39            -7.290667e-02 1.770372e-02  -4.118156643 7.456156e-02
## Store40            -4.902298e-01 7.841628e-01  -0.625163271 4.247493e+00
## Store41            -1.983603e-01 2.191421e-01  -0.905167211 1.171451e+00
## Store42            -5.771796e+02 2.098210e+03  -0.275081826 1.057891e+03
## Store43            -5.758362e+02 2.098189e+03  -0.274444388 7.744298e+02
## Store44            -5.778830e+02 2.098208e+03  -0.275417394 1.099685e+03
## Store45             1.006451e+03 1.466482e+03   0.686302719 5.237233e+02
## IsHolidayTRUE       3.186190e-02 3.506886e-02   0.908552341 5.858405e-03
## Temperature         1.636358e-03 1.545515e-03   1.058778776 1.596248e-03
## Year                7.150664e-01 4.842609e-01   1.476613974          NaN
## Month2              1.780026e-01 7.499032e-02   2.373673923 5.219311e-02
## Month3              1.953422e-01 1.013864e-01   1.926710332 6.959458e-02
## Month4              2.633347e-01 1.372981e-01   1.917978682 8.198018e-02
## Month5              3.262366e-01 1.750388e-01   1.863795982 8.506892e-02
## Month6              3.925996e-01 2.116653e-01   1.854813078 2.387191e-02
## Month7              4.177294e-01 2.511205e-01   1.663461933 9.296993e-02
## Month8              5.000908e-01 2.905928e-01   1.720933083 9.211143e-02
## Month9              5.006216e-01 3.275758e-01   1.528262001 8.568678e-02
## Month10             5.862588e-01 3.657519e-01   1.602886820 1.278198e-01
## Month11             7.613896e-01 4.056344e-01   1.877034100 1.697661e-01
## Month12             9.161452e-01 4.467503e-01   2.050687517 1.310361e-01
## Fuel_Price         -1.558916e-02 4.108731e-02  -0.379415384 8.942240e-02
## CPI                 1.140325e-05 9.749408e-03   0.001169635 5.220979e-02
## Unemployment       -1.419603e-02 2.676413e-02  -0.530412521 1.450155e-01
## CPI:TypeB           1.119169e-02 1.538619e-02   0.727385316 1.481311e-01
## CPI:TypeC          -1.434467e-02 2.038767e-02  -0.703595288 2.446826e-01
## Unemployment:TypeB -3.029170e-04 4.734142e-02  -0.006398561 4.403203e-01
## Unemployment:TypeC -4.297132e-02 6.071150e-02  -0.707795348 5.396582e-01
## Year:TypeB         -5.030728e-01 7.296162e-01  -0.689503371 8.669763e-02
## Year:TypeC          2.882768e-01 1.043922e+00   0.276147803 6.075234e-01
## Month2:TypeB       -3.043922e-02 1.126613e-01  -0.270183458 1.144372e-01
## Month3:TypeB       -6.274382e-02 1.500908e-01  -0.418038987 1.405028e-01
## Month4:TypeB       -1.143091e-01 2.019003e-01  -0.566165807 1.425199e-01
## Month5:TypeB       -1.629676e-01 2.578299e-01  -0.632074227 1.258207e-01
## Month6:TypeB       -1.613557e-01 3.119952e-01  -0.517173603 7.679049e-02
## Month7:TypeB       -2.108634e-01 3.718702e-01  -0.567035026 9.554537e-02
## Month8:TypeB       -2.657217e-01 4.333481e-01  -0.613183047 1.489865e-01
## Month9:TypeB       -3.174844e-01 4.924484e-01  -0.644705942 1.467204e-01
## Month10:TypeB      -3.607307e-01 5.519925e-01  -0.653506578 2.396537e-01
## Month11:TypeB      -3.589223e-01 6.128549e-01  -0.585656301 1.228570e-01
## Month12:TypeB      -3.503463e-01 6.739612e-01  -0.519831508 1.454221e-01
## Month2:TypeC       -8.353039e-02 1.613635e-01  -0.517653551 2.396666e-01
## Month3:TypeC       -3.162013e-02 2.149176e-01  -0.147126748 3.153127e-01
## Month4:TypeC       -2.219818e-02 2.891465e-01  -0.076771422 3.139010e-01
## Month5:TypeC        4.781777e-03 3.689539e-01   0.012960364 2.994857e-01
## Month6:TypeC       -1.558334e-02 4.463288e-01  -0.034914480 2.964931e-01
## Month7:TypeC        1.460933e-02 5.320568e-01   0.027458224 2.581955e-01
## Month8:TypeC        2.747509e-02 6.199953e-01   0.044314990 2.881344e-01
## Month9:TypeC        1.094247e-01 7.044866e-01   0.155325509 2.657990e-01
## Month10:TypeC       1.232995e-01 7.898484e-01   0.156105218 3.735162e-01
## Month11:TypeC       5.528615e-02 8.767803e-01   0.063055876 3.734813e-01
## Month12:TypeC      -3.179916e-02 9.642127e-01  -0.032979407 3.284708e-01
## TypeA:Day_To_Begin -3.138619e-03 1.328615e-03  -2.362324613 6.713548e-04
## TypeB:Day_To_Begin  1.930695e-03 1.508267e-03   1.280075421 1.458908e-03
## TypeC:Day_To_Begin -5.954641e-03 2.541451e-03  -2.343008052 2.429951e-03
##                         Robust z
## (Intercept)        -7.555490e+00
## Store2              1.038092e+01
## Store3              2.090258e+00
## Store4              6.249739e-02
## Store5              1.899035e+00
## Store6             -6.588709e-02
## Store7              1.941477e+00
## Store8             -1.977737e+00
## Store9              1.704866e+00
## Store10             2.832890e+00
## Store11            -7.898211e-01
## Store12             2.078843e+00
## Store13             5.797191e-02
## Store14             1.833671e-01
## Store15             1.968966e+00
## Store16             1.909148e+00
## Store17             2.011015e+00
## Store18             2.228727e+00
## Store19            -1.068608e-02
## Store20             8.608118e-01
## Store21             2.405266e+00
## Store22             1.981384e+00
## Store23             2.474404e+00
## Store24            -2.434713e-02
## Store25             3.630641e+00
## Store26            -9.568985e-02
## Store27             3.881471e-02
## Store28            -1.733704e-02
## Store29             2.362389e+00
## Store30            -4.907465e-01
## Store31            -5.289409e+00
## Store32            -2.209289e-01
## Store33            -3.889899e-01
## Store34            -9.383854e-02
## Store35             1.994508e+00
## Store36            -1.874856e+01
## Store37            -6.243262e-01
## Store38            -8.059244e-01
## Store39            -9.778051e-01
## Store40            -1.154163e-01
## Store41            -1.693286e-01
## Store42            -5.455943e-01
## Store43            -7.435616e-01
## Store44            -5.254988e-01
## Store45             1.921722e+00
## IsHolidayTRUE       5.438664e+00
## Temperature         1.025128e+00
## Year                         NaN
## Month2              3.410461e+00
## Month3              2.806859e+00
## Month4              3.212176e+00
## Month5              3.834969e+00
## Month6              1.644609e+01
## Month7              4.493167e+00
## Month8              5.429194e+00
## Month9              5.842460e+00
## Month10             4.586604e+00
## Month11             4.484935e+00
## Month12             6.991548e+00
## Fuel_Price         -1.743317e-01
## CPI                 2.184121e-04
## Unemployment       -9.789323e-02
## CPI:TypeB           7.555260e-02
## CPI:TypeC          -5.862564e-02
## Unemployment:TypeB -6.879468e-04
## Unemployment:TypeC -7.962692e-02
## Year:TypeB         -5.802613e+00
## Year:TypeC          4.745115e-01
## Month2:TypeB       -2.659906e-01
## Month3:TypeB       -4.465664e-01
## Month4:TypeB       -8.020571e-01
## Month5:TypeB       -1.295237e+00
## Month6:TypeB       -2.101245e+00
## Month7:TypeB       -2.206945e+00
## Month8:TypeB       -1.783529e+00
## Month9:TypeB       -2.163874e+00
## Month10:TypeB      -1.505217e+00
## Month11:TypeB      -2.921463e+00
## Month12:TypeB      -2.409168e+00
## Month2:TypeC       -3.485274e-01
## Month3:TypeC       -1.002818e-01
## Month4:TypeC       -7.071715e-02
## Month5:TypeC        1.596663e-02
## Month6:TypeC       -5.255885e-02
## Month7:TypeC        5.658245e-02
## Month8:TypeC        9.535511e-02
## Month9:TypeC        4.116823e-01
## Month10:TypeC       3.301048e-01
## Month11:TypeC       1.480292e-01
## Month12:TypeC      -9.680972e-02
## TypeA:Day_To_Begin -4.675053e+00
## TypeB:Day_To_Begin  1.323384e+00
## TypeC:Day_To_Begin -2.450519e+00

We will interpret the model result together with GLM model introduced below.

Generalized linear model Using Quasi-Likelihood

Our last model is generalized linear model using quasi-likelihood. First we observe that

c(mean(data_analysis$Weekly_Sales), var(data_analysis$Weekly_Sales))
## [1]     15981.26 515797856.84

The variance is nearly the square of the mean. Therefore we use quasi-likelihood in which variance is \(\mu^2\), and take log-link to downsize the sales data. The covariates are those selected using stepwise regression. The covariates and the link function are selected so that the model does not have a very large or very small residual deviance, making the model more convincing.

data.fit2 <- data_analysis
# make response positive
data.fit2$Weekly_Sales <- data.fit2$Weekly_Sales + abs(min(data.fit2$Weekly_Sales)) + 2
data.fit2$Month <- factor(data.fit2$Month)
# GLM using quasi likelihood
glm.model <- glm(data=data.fit2, Weekly_Sales ~ Store + Dept + IsHoliday + Temperature + Year + Month + Fuel_Price + CPI + Unemployment
      + Type:CPI + Type:Unemployment
      + Type:Year + Type:Month + Type:Day_To_Begin,
      family = quasi(link = "log", variance = "mu^2"))
summary(glm.model)
## 
## Call:
## glm(formula = Weekly_Sales ~ Store + Dept + IsHoliday + Temperature + 
##     Year + Month + Fuel_Price + CPI + Unemployment + Type:CPI + 
##     Type:Unemployment + Type:Year + Type:Month + Type:Day_To_Begin, 
##     family = quasi(link = "log", variance = "mu^2"), data = data.fit2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8943  -0.2345  -0.0422   0.1615   4.0817  
## 
## Coefficients:
##                      Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        -5.270e+02  7.032e+01   -7.495 6.63e-14 ***
## Store2              1.725e-01  5.407e-03   31.905  < 2e-16 ***
## Store3              6.286e+02  1.067e+02    5.893 3.79e-09 ***
## Store4              3.985e-01  6.164e-02    6.465 1.01e-10 ***
## Store5              6.286e+02  1.067e+02    5.892 3.81e-09 ***
## Store6              4.717e-02  5.885e-03    8.015 1.11e-15 ***
## Store7              6.290e+02  1.067e+02    5.897 3.71e-09 ***
## Store8             -3.228e-01  6.778e-03  -47.627  < 2e-16 ***
## Store9              6.288e+02  1.067e+02    5.894 3.76e-09 ***
## Store10             6.301e+02  1.067e+02    5.907 3.48e-09 ***
## Store11            -4.549e-02  6.050e-03   -7.518 5.58e-14 ***
## Store12             6.298e+02  1.067e+02    5.904 3.56e-09 ***
## Store13             3.908e-01  6.178e-02    6.326 2.51e-10 ***
## Store14             2.942e-01  2.193e-02   13.412  < 2e-16 ***
## Store15             6.294e+02  1.067e+02    5.900 3.63e-09 ***
## Store16             6.289e+02  1.067e+02    5.896 3.73e-09 ***
## Store17             6.296e+02  1.067e+02    5.902 3.59e-09 ***
## Store18             6.297e+02  1.067e+02    5.903 3.57e-09 ***
## Store19             1.804e-01  5.737e-02    3.144 0.001664 ** 
## Store20             2.700e-01  7.419e-03   36.396  < 2e-16 ***
## Store21             6.290e+02  1.067e+02    5.897 3.71e-09 ***
## Store22             6.296e+02  1.067e+02    5.902 3.58e-09 ***
## Store23             6.299e+02  1.067e+02    5.905 3.53e-09 ***
## Store24             1.276e-01  5.747e-02    2.220 0.026421 *  
## Store25             6.290e+02  1.067e+02    5.896 3.72e-09 ***
## Store26            -9.290e-02  5.745e-02   -1.617 0.105885    
## Store27             3.053e-01  5.460e-02    5.591 2.26e-08 ***
## Store28             1.986e-01  6.369e-02    3.118 0.001823 ** 
## Store29             6.294e+02  1.067e+02    5.900 3.65e-09 ***
## Store30            -3.446e+02  1.724e+02   -1.999 0.045639 *  
## Store31            -1.100e-01  5.422e-03  -20.289  < 2e-16 ***
## Store32            -1.072e-01  1.707e-02   -6.280 3.39e-10 ***
## Store33            -7.749e-01  6.204e-02  -12.489  < 2e-16 ***
## Store34            -8.280e-02  6.252e-02   -1.324 0.185407    
## Store35             6.296e+02  1.067e+02    5.902 3.58e-09 ***
## Store36            -8.158e-01  6.329e-03 -128.900  < 2e-16 ***
## Store37            -3.445e+02  1.724e+02   -1.998 0.045692 *  
## Store38            -3.449e+02  1.724e+02   -2.001 0.045442 *  
## Store39            -3.054e-02  5.562e-03   -5.491 3.99e-08 ***
## Store40            -1.408e-01  5.706e-02   -2.468 0.013601 *  
## Store41            -6.734e-02  1.677e-02   -4.016 5.92e-05 ***
## Store42            -3.448e+02  1.724e+02   -2.000 0.045477 *  
## Store43            -3.443e+02  1.724e+02   -1.997 0.045783 *  
## Store44            -3.451e+02  1.724e+02   -2.002 0.045272 *  
## Store45             6.292e+02  1.067e+02    5.898 3.68e-09 ***
## Dept2               6.606e-01  6.814e-03   96.946  < 2e-16 ***
## Dept3              -3.771e-01  6.814e-03  -55.337  < 2e-16 ***
## Dept4               2.456e-01  6.814e-03   36.037  < 2e-16 ***
## Dept5               2.063e-02  6.838e-03    3.017 0.002555 ** 
## Dept6              -9.057e-01  6.943e-03 -130.462  < 2e-16 ***
## Dept7               9.726e-02  6.814e-03   14.274  < 2e-16 ***
## Dept8               3.489e-01  6.814e-03   51.207  < 2e-16 ***
## Dept9              -3.132e-03  6.836e-03   -0.458 0.646860    
## Dept10             -7.885e-02  6.814e-03  -11.573  < 2e-16 ***
## Dept11             -2.298e-01  6.814e-03  -33.728  < 2e-16 ***
## Dept12             -9.175e-01  6.821e-03 -134.516  < 2e-16 ***
## Dept13              3.755e-01  6.814e-03   55.114  < 2e-16 ***
## Dept14             -2.358e-01  6.814e-03  -34.610  < 2e-16 ***
## Dept16             -2.421e-01  6.814e-03  -35.533  < 2e-16 ***
## Dept17             -4.858e-01  6.820e-03  -71.228  < 2e-16 ***
## Dept18             -7.281e-01  7.278e-03 -100.043  < 2e-16 ***
## Dept19             -1.331e+00  7.724e-03 -172.325  < 2e-16 ***
## Dept20             -8.361e-01  6.928e-03 -120.689  < 2e-16 ***
## Dept21             -8.509e-01  6.814e-03 -124.885  < 2e-16 ***
## Dept22             -4.913e-01  7.087e-03  -69.323  < 2e-16 ***
## Dept23              1.033e-01  7.009e-03   14.734  < 2e-16 ***
## Dept24             -8.101e-01  7.136e-03 -113.522  < 2e-16 ***
## Dept25             -5.591e-01  6.836e-03  -81.789  < 2e-16 ***
## Dept26             -6.880e-01  7.021e-03  -97.992  < 2e-16 ***
## Dept27             -1.294e+00  7.061e-03 -183.207  < 2e-16 ***
## Dept28             -1.379e+00  6.902e-03 -199.822  < 2e-16 ***
## Dept29             -8.845e-01  7.173e-03 -123.311  < 2e-16 ***
## Dept30             -1.027e+00  7.180e-03 -142.981  < 2e-16 ***
## Dept31             -1.158e+00  6.918e-03 -167.385  < 2e-16 ***
## Dept32             -7.425e-01  6.952e-03 -106.808  < 2e-16 ***
## Dept33             -8.037e-01  7.142e-03 -112.521  < 2e-16 ***
## Dept34             -2.405e-01  7.166e-03  -33.564  < 2e-16 ***
## Dept35             -1.154e+00  7.178e-03 -160.816  < 2e-16 ***
## Dept36             -1.265e+00  7.178e-03 -176.219  < 2e-16 ***
## Dept37             -1.258e+00  9.040e-03 -139.192  < 2e-16 ***
## Dept38              1.088e+00  6.814e-03  159.679  < 2e-16 ***
## Dept39             -1.799e+00  9.679e-02  -18.589  < 2e-16 ***
## Dept40              7.114e-01  6.814e-03  104.412  < 2e-16 ***
## Dept41             -1.264e+00  7.141e-03 -177.054  < 2e-16 ***
## Dept42             -8.323e-01  6.823e-03 -121.984  < 2e-16 ***
## Dept43             -1.395e+00  1.117e-01  -12.488  < 2e-16 ***
## Dept44             -9.481e-01  7.113e-03 -133.301  < 2e-16 ***
## Dept45             -1.611e+00  1.003e-02 -160.537  < 2e-16 ***
## Dept46              5.690e-03  6.814e-03    0.835 0.403708    
## Dept47             -1.634e+00  1.596e-02 -102.392  < 2e-16 ***
## Dept48             -1.470e+00  1.047e-02 -140.319  < 2e-16 ***
## Dept49             -7.310e-01  7.528e-03  -97.107  < 2e-16 ***
## Dept50             -1.346e+00  1.095e-02 -122.895  < 2e-16 ***
## Dept51             -1.577e+00  1.144e-02 -137.798  < 2e-16 ***
## Dept52             -1.188e+00  6.860e-03 -173.207  < 2e-16 ***
## Dept54             -1.580e+00  7.393e-03 -213.641  < 2e-16 ***
## Dept55             -4.892e-01  7.069e-03  -69.203  < 2e-16 ***
## Dept56             -9.811e-01  6.977e-03 -140.610  < 2e-16 ***
## Dept58             -1.143e+00  7.576e-03 -150.813  < 2e-16 ***
## Dept59             -1.371e+00  6.906e-03 -198.554  < 2e-16 ***
## Dept60             -1.364e+00  7.008e-03 -194.582  < 2e-16 ***
## Dept65              8.545e-01  3.290e-02   25.972  < 2e-16 ***
## Dept67             -6.558e-01  6.814e-03  -96.244  < 2e-16 ***
## Dept71             -8.794e-01  7.173e-03 -122.602  < 2e-16 ***
## Dept72              7.408e-01  6.925e-03  106.981  < 2e-16 ***
## Dept74             -2.879e-01  6.815e-03  -42.239  < 2e-16 ***
## Dept77             -1.618e+00  3.195e-02  -50.648  < 2e-16 ***
## Dept78             -1.660e+00  2.568e-02  -64.623  < 2e-16 ***
## Dept79              9.902e-02  6.814e-03   14.532  < 2e-16 ***
## Dept80             -3.061e-01  6.971e-03  -43.907  < 2e-16 ***
## Dept81             -1.339e-01  6.814e-03  -19.645  < 2e-16 ***
## Dept82             -1.706e-01  6.814e-03  -25.042  < 2e-16 ***
## Dept83             -1.002e+00  6.966e-03 -143.890  < 2e-16 ***
## Dept85             -1.158e+00  6.922e-03 -167.349  < 2e-16 ***
## Dept87             -2.987e-01  6.826e-03  -43.761  < 2e-16 ***
## Dept90              7.435e-01  6.814e-03  109.122  < 2e-16 ***
## Dept91              4.600e-01  6.814e-03   67.506  < 2e-16 ***
## Dept92              1.189e+00  6.814e-03  174.530  < 2e-16 ***
## Dept93              2.887e-01  6.965e-03   41.447  < 2e-16 ***
## Dept94              4.987e-01  7.036e-03   70.875  < 2e-16 ***
## Dept95              1.128e+00  6.814e-03  165.516  < 2e-16 ***
## Dept96             -9.015e-02  7.355e-03  -12.257  < 2e-16 ***
## Dept97             -2.057e-01  6.857e-03  -29.999  < 2e-16 ***
## Dept98             -6.694e-01  6.988e-03  -95.792  < 2e-16 ***
## Dept99             -1.641e+00  1.405e-02 -116.772  < 2e-16 ***
## IsHolidayTRUE       2.305e-02  2.578e-03    8.939  < 2e-16 ***
## Temperature         7.992e-04  1.136e-04    7.037 1.96e-12 ***
## Year                2.671e-01  3.498e-02    7.635 2.26e-14 ***
## Month2              1.263e-01  5.409e-03   23.344  < 2e-16 ***
## Month3              1.181e-01  7.322e-03   16.125  < 2e-16 ***
## Month4              1.511e-01  9.924e-03   15.225  < 2e-16 ***
## Month5              1.818e-01  1.265e-02   14.367  < 2e-16 ***
## Month6              2.091e-01  1.531e-02   13.658  < 2e-16 ***
## Month7              2.007e-01  1.816e-02   11.050  < 2e-16 ***
## Month8              2.434e-01  2.101e-02   11.585  < 2e-16 ***
## Month9              2.121e-01  2.367e-02    8.961  < 2e-16 ***
## Month10             2.623e-01  2.642e-02    9.928  < 2e-16 ***
## Month11             3.816e-01  2.930e-02   13.021  < 2e-16 ***
## Month12             5.310e-01  3.227e-02   16.453  < 2e-16 ***
## Fuel_Price         -6.997e-03  3.014e-03   -2.321 0.020261 *  
## CPI                 2.198e-03  7.059e-04    3.114 0.001849 ** 
## Unemployment       -1.539e-02  1.922e-03   -8.009 1.16e-15 ***
## CPI:TypeB           3.785e-03  1.122e-03    3.373 0.000744 ***
## CPI:TypeC          -7.009e-03  1.671e-03   -4.194 2.74e-05 ***
## Unemployment:TypeB  1.253e-03  3.422e-03    0.366 0.714216    
## Unemployment:TypeC -1.257e-02  5.004e-03   -2.513 0.011986 *  
## Year:TypeB         -3.135e-01  5.307e-02   -5.907 3.49e-09 ***
## Year:TypeC          1.719e-01  8.577e-02    2.004 0.045080 *  
## Month2:TypeB       -4.327e-02  8.166e-03   -5.298 1.17e-07 ***
## Month3:TypeB       -5.232e-02  1.090e-02   -4.801 1.58e-06 ***
## Month4:TypeB       -8.009e-02  1.467e-02   -5.458 4.83e-08 ***
## Month5:TypeB       -1.100e-01  1.874e-02   -5.869 4.38e-09 ***
## Month6:TypeB       -1.160e-01  2.269e-02   -5.112 3.20e-07 ***
## Month7:TypeB       -1.467e-01  2.705e-02   -5.422 5.88e-08 ***
## Month8:TypeB       -1.812e-01  3.152e-02   -5.747 9.07e-09 ***
## Month9:TypeB       -2.038e-01  3.582e-02   -5.690 1.27e-08 ***
## Month10:TypeB      -2.373e-01  4.015e-02   -5.911 3.41e-09 ***
## Month11:TypeB      -2.563e-01  4.458e-02   -5.750 8.94e-09 ***
## Month12:TypeB      -2.746e-01  4.902e-02   -5.602 2.12e-08 ***
## Month2:TypeC       -8.632e-02  1.330e-02   -6.490 8.62e-11 ***
## Month3:TypeC       -5.225e-02  1.769e-02   -2.955 0.003131 ** 
## Month4:TypeC       -4.983e-02  2.378e-02   -2.096 0.036106 *  
## Month5:TypeC       -3.986e-02  3.033e-02   -1.314 0.188851    
## Month6:TypeC       -4.753e-02  3.670e-02   -1.295 0.195353    
## Month7:TypeC       -2.020e-02  4.374e-02   -0.462 0.644244    
## Month8:TypeC       -2.281e-02  5.096e-02   -0.448 0.654421    
## Month9:TypeC        4.874e-02  5.791e-02    0.842 0.400006    
## Month10:TypeC       4.766e-02  6.492e-02    0.734 0.462872    
## Month11:TypeC      -2.041e-02  7.207e-02   -0.283 0.776994    
## Month12:TypeC      -1.325e-01  7.924e-02   -1.673 0.094384 .  
## TypeA:Day_To_Begin -7.734e-04  9.599e-05   -8.057 7.82e-16 ***
## TypeB:Day_To_Begin  1.999e-05  1.100e-04    0.182 0.855829    
## TypeC:Day_To_Begin -1.209e-03  2.151e-04   -5.620 1.91e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasi family taken to be 0.1493835)
## 
##     Null deviance: 320600  on 421569  degrees of freedom
## Residual deviance:  61414  on 421397  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 7

The definition of store types A, B, C is not mentioned from the data source, but we infer them to be Walmart Supercenter, Walmart Neighbourhood Market and Walmart Express stores respectively considering their size, amount and the description of Walmart store types before 2016.

Model interpretation:

  • The estimated coefficient for holiday indicator is 2.305e-02, with a p-value less than 2e-16, which suggests that holidays like Thanksgiving and Christmas indeed promote the sales and the holiday effect is statically significant. Same result is also observed in GEE model.

  • The estimated coefficient for temperature is 7.992e-04, with a p-value 1.96e-12, which indicates that temperature correlates with sales positively. When the temperature goes down, people may not want to go out and shop in retail stores since it is too cold. Same result is also observed in GEE model.

  • The estimated coefficient for fuel price is -6.997e-03, with a p-value 0.02, which says that upgoing fuel price also reduces sales, and the effect is statistically significant. Our proposed reason is that when fuel price goes up, people may not want to drive to retail stores to go shopping. Same result is also observed in GEE model.

  • The estimated coefficient for CPI is 2.198e-03, with a p-value 0.001, which suggests that CPI is also positively correlated with sales and this makes perfect sense. This result is supported by the data with statistical significance. Same result is also observed in GEE model.

  • The estimated coefficient for the unemployment rate is -1.539e-02, with a p-value 1.16e-15, which means that the unemployment rate is negatively correlated with sales. This result has statistical significance. The reason for such a phenomenon is that when the unemployment rate goes up, people may not have enough money to shop in retail stores. Same result is also observed in GEE model.

  • The estimated coefficient for Year is 2.671e-01, with a p-value 2.26e-14, which indicates that weekly sales is increasing over these years. The result is also statistically significant. Same result is also observed in GEE model.

  • The estimated coefficient for month indicator has the largest value in November and December, which makes perfect sense since Thanksgiving and Christmas holidays are in these two months and the sales have a outburst during these holidays. Same result is also observed in GEE model.

  • Comparing with type A store (supercenter), holding other covariates the same, when CPI goes up, all stores’ weekly sales will go up. However, type B stores’ (neighborhood market) weekly sales have larger increase speed while type C stores’ (Express) weekly sales decrease more slowly. Same result is also observed in GEE model.

  • Comparing with type A store (supercenter), holding other covariates the same, when unemployment rate goes up, all stores’ weekly sales will go down. However, type B stores’ (neighborhood market) weekly sales decrease faster while type C stores’ weekly sales have a less decreasing speed. While in GEE model, type B (neighborhood market) and type C (Express) stores’ weekly sales both have a less decreasing speed compared with type A stores.

  • Comparing with type A store (supercenter), holding other covariates the same, as time goes, all stores’ weekly sales will go up. However, type B stores’ (neighborhood market) weekly sales have a slower increasing speed while type C (Express) stores’ weekly sales have a faster uprising speed. Same result is also observed in GEE model.

To sum up, we divide the covariates into “helpers” and “suppressors”:
Positively correlated with weekly sales (helpers): Is Holiday indicator, CPI, Time, Temperature
Negatively correlated with weekly sales (suppressors): Unemployment rate, fuel price

Potential limitations: The residual deviance is 62,400 on 421,433 degrees of freedom. Even though there may be a problem of overfitting, but we use quasi-likelihood to make a control and the model is convincing. What’s more, we only consider linear interactions but there may be some nonlinear interactions between covariates.

Advice for sales manager in Walmart:

  • Holiday effect is statistically siginificant. Therefore, stores can host some sales promotion activities on national holidays to increase their profit.

  • It helps to make more promotion activities in the first quarter to attract customers.

Predictive algorithm

We first partitioned the data into training and testing set. In application, we are more interested in predicting the weekly sales in a certain period. As a result, we introduced p as the percentage of training data, and used the first p percent of the days as the training data and the rest as test data.

## partition into train and test data
p <- 0.8
data_train <- data_analysis[which(data_analysis$Day_To_Begin <= quantile(data_analysis$Day_To_Begin, p)),]
data_test <- data_analysis[which(data_analysis$Day_To_Begin > quantile(data_analysis$Day_To_Begin, p)),]

We chose to try parallel computing, which had been covered in our course lectures, to reduce the time consumption of our machine learning methods.

## parallel computing
registerDoMC(4)

Given that we have discussed Tree-related classification methods intensively in our Advanced Data Science course on several sample questions, we are curious about the preformance of Tree-related machine learning methods in answering regression (rather than classification) questions.

Our exploration of Tree-related regression methods began with the Classification and Regression Tree (CART) algorithm. This method is the most primitive machine learning method applying decision trees. In the caret package, we used method = rpart1SE to realize this method.

The four major reasons of using CART:

  • The CART model is simple that could be explained by a tree structure.

  • The time usage of CART model is very low even on large samples, which means that the method is convenient.

  • We have learned the structure of CART in our courses and we hope to find their application in real questions.

  • The CART model serves as the benchmark of all the three Tree-related machine learning methods we practiced. We would expect that more elaborated machine learning methods would give a better prediction performance than the CART model.

Since the random forest (RF) algorithm that we were going to implement later was very slow and took a very long time to generate a prediction result, we used the first 50000 observations in our training dataset for all the machine learning methods to make it fair in performance evaluation.

## train the model using proposed machine learning methods
### xgbTree
#### training
ptm <- proc.time()
fit.CART <- train(Weekly_Sales ~ ., data = data_train[1:50000,], method = "rpart1SE", 
               trControl = trainControl("cv", number = 10))
proc.time() - ptm
#### prediction
pred.CART <- predict(fit.CART, newdata = data_test)

# save(pred.CART, file = "pred.CART.rda")

The second machine learning method we propose is the Gradient Boosting Decision Tree (GBDT) algorithm. This method is built on the fundament of Decision Trees. There are three major reasons that we choose this model:

  • We are familiar with the mathematical background of Gradient Boosting Trees so we feel safe to use it.

  • This model is ensembled in the caret package thus there are no additional difficulties in constructing the model in R.

  • This model is fairly efficient that the time for running this machine learning code is acceptable given that we use parallel computing technique.

## train the model using proposed machine learning methods
### xgbTree
#### training
set.seed(123)
ptm <- proc.time()
fit.xgbTree <- train(Weekly_Sales ~ ., data = data_train[1:50000,], method = "xgbTree", 
               trControl = trainControl("cv", number = 10))
proc.time() - ptm
#### prediction
pred.xgbTree <- predict(fit.xgbTree, newdata = data_test)

# save(pred.xgbTree, file = "pred.xgbTree.rda")

Finally, we would compare the results of Gradient Boosting Trees with Random Forest (RF) algorithm. There are two reasons for us to choose Random Forest algorithm:

  • Random Forest is also based on Decision Trees as well as GBDT, and we believe that RF is more comparable with GBDT than other methods (e.g., Neural Network) not using a similar framework.

  • The Random Forest method is also easy to implement as we could use the rf method embedded in the caret package.

### random forest
#### training
ptm <- proc.time()
fit.rf <- train(Weekly_Sales ~ ., data = data_train[1:50000,], method = "rf",
               trControl = trainControl("cv", number = 10))
proc.time() - ptm
#### prediction
pred.rf <- predict(fit.rf, newdata = data_test)

# save(pred.rf, file = "pred.rf.rda")

Comparison of prediction models

With the prediction result of each machine learning models, we first had a look at the predicted sales in comparison with real sales in the test data.

We have provided some visualizations to evaluate our machine learning models.

First, we loaded the results of machine learning prediction to avoid wasting time on running machine learning codes.

load(file = "./Intermediate/fit.CART.rda")
load(file = "./Intermediate/fit.xgbTree.rda")
load(file = "./Intermediate/fit.rf.rda")

Second, we arranged the results into a form that is ready for exploratory analysis. We want to compare the machine learning results with the GLM model that we fit in the previous section.

data_train_glm <- data_train
data_train_glm$Weekly_Sales <- data_train_glm$Weekly_Sales + abs(min(data_train_glm$Weekly_Sales)) + 2


glm.model <- glm(data = data_train_glm, Weekly_Sales ~ Dept + IsHoliday + Temperature + Year + Month + Fuel_Price + CPI + Unemployment
      + Type:CPI + Type:Unemployment
      + Type:Year + Type:Month + Type:Day_To_Begin,
      family = quasi(link = "log", variance = "mu^2"))

data_test <- data_test[data_test$Store %in% c(1,2,3,4,5,6),]

pred.glm <- predict(glm.model, newdata = data_test, type = "response") - abs(min(data_train_glm$Weekly_Sales)) - 2
pred.CART <- predict(fit.CART, newdata = data_test)
pred.xgbTree <- predict(fit.xgbTree, newdata = data_test)
## [23:29:58] WARNING: amalgamation/../src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
data_test1 <- data_test
data_test1$Day_To_Begin <- data_test1$Date - as.Date("2010-01-01")
pred.rf <- predict(fit.rf, newdata = data_test1)

ws.true <- data.frame(Weekly_Sales = data_test$Weekly_Sales, 
                      Date = data_test$Date, 
                      model = "True Value")
ws.glm <- data.frame(Weekly_Sales = pred.glm,
                      Date = data_test$Date,
                      model = "glm")
ws.CART <- data.frame(Weekly_Sales = pred.CART,
                      Date = data_test$Date,
                      model = "CART")
ws.xgbTree <- data.frame(Weekly_Sales = pred.xgbTree, 
                      Date = data_test$Date, 
                      model = "xgbTree")
ws.rf <- data.frame(Weekly_Sales = pred.rf, 
                      Date = data_test$Date, 
                      model = "RF")


ws.all <- rbind(ws.true, ws.glm, ws.CART, ws.xgbTree, ws.rf)
ws.all$Observation_ID <- rep((seq(1,11516)), 5)

We have produced two plots to compare the model-based predictions and the true value that we have in our dataset.

## predicted sales from different models in comparison with real sales
ggplot(ws.all[ws.all$model %in% c("True Value", "glm"),]) +
  geom_line(aes(x = Observation_ID, y = Weekly_Sales, col = model)) +
  labs(title = "Sales prediction performance of GLM model", y = "Weekly Sales", x = "Observation ID") +
  ylim(c(0, 200000))

Comp_GLM <- cbind(ws.true[,1], ws.glm[,1]) %>% as.data.frame()

ggplot(Comp_GLM) +
  geom_point(aes(x = V1, y = V2)) +
  geom_abline(slope = 1, intercept = 0, col = "red", size = 1.5, alpha = 0.75) +
  labs(title = "Reality vs GLM model prediction", y = "GLM Prediction", x = "Reality", subtitle = "The red line is the reference line") +
  xlim(c(0, 200000)) +
  ylim(c(0, 200000))

First, we compare the reality versus a CART model prediction.

## predicted sales from different models in comparison with real sales
ggplot(ws.all[ws.all$model %in% c("True Value", "CART"),]) +
  geom_line(aes(x = Observation_ID, y = Weekly_Sales, col = model)) +
  labs(title = "Sales prediction performance of CART model", y = "Weekly Sales", x = "Observation ID") +
  ylim(c(0, 200000))

Comp_CART <- cbind(ws.true[,1], ws.CART[,1]) %>% as.data.frame()

ggplot(Comp_CART) +
  geom_point(aes(x = V1, y = V2)) +
  geom_abline(slope = 1, intercept = 0, col = "red", size = 1.5, alpha = 0.75) +
  labs(title = "Reality vs CART model prediction", y = "CART Prediction", x = "Reality", subtitle = "The red line is the reference line") +
  xlim(c(0, 200000)) +
  ylim(c(0, 200000))

Since the structure of CART is simple, we only had ~15 different levels of values for different values of predictors. For a prediction with better quality, we should consider more elaborate model based on this CART model.

Second, we compared the reality with GBDT prediction.

## predicted sales from different models in comparison with real sales
ggplot(ws.all[ws.all$model %in% c("True Value", "xgbTree"),]) +
  geom_line(aes(x = Observation_ID, y = Weekly_Sales, col = model)) +
  labs(title = "Sales prediction performance of GBDT model", y = "Weekly Sales", x = "Observation ID") +
  ylim(c(0, 200000))

Comp_xgbTree <- cbind(ws.true[,1], ws.xgbTree[,1]) %>% as.data.frame()

ggplot(Comp_xgbTree) +
  geom_point(aes(x = V1, y = V2)) +
  geom_abline(slope = 1, intercept = 0, col = "red", size = 1.5, alpha = 0.75) +
  labs(title = "Reality vs GBDT model prediction", y = "GBDT Prediction", x = "Reality", subtitle = "The red line is the reference line") +
  xlim(c(0, 200000)) +
  ylim(c(0, 200000))

We could see the diagonality on the point plot given above, which means that the GBDT algorithm grabs more information from the predictors in giving a prediction. However, we could see some outliers when the real sales value is ~25000 but the prediction value is ~130000, which is not good.

Finally, we compared the reality with Random Forest prediction.

## predicted sales from different models in comparison with real sales
ggplot(ws.all[ws.all$model %in% c("True Value", "RF"),]) +
  geom_line(aes(x = Observation_ID, y = Weekly_Sales, col = model)) +
  labs(title = "Sales prediction performance of Random Forest model", y = "Weekly Sales", x = "Observation ID") +
  ylim(c(0, 200000))

Comp_rf <- cbind(ws.true[,1], ws.rf[,1]) %>% as.data.frame()

ggplot(Comp_rf) +
  geom_point(aes(x = V1, y = V2)) +
  geom_abline(slope = 1, intercept = 0, col = "red", size = 1.5, alpha = 0.75) +
  labs(title = "Reality vs RF model prediction", y = "RF Prediction", x = "Reality", subtitle = "The red line is the reference line") +
  xlim(c(0, 200000)) +
  ylim(c(0, 200000))

We could also see the diagonality on the point plot given above, which means that the RF algorithm also grabs more information from the predictors in giving a prediction. The outliers in the Reality vs GBDT plot disappeared in the Reality vs RF plot, which indicates that RF algorithm may have a better performance than GBDT.

We used the square root of mean squared deviance as a score to evaluate the fitting performance.

\[\sqrt{\frac{1}{n}\Sigma^{n}_{i = 1}(y_i - \hat{y}_i)^2}\]

SCORE <- function(data_test, vector_pre){
  temp <- (data_test$Weekly_Sales - vector_pre) ^ 2
  score <- sum(temp)/length(vector_pre)
  return(sqrt(score))
}

The performance score for the CART, Gradient Boost Regression Tree (GBRT), and Random Forest (RF) fitting methods were calculated as below:

SCORE(data_test, pred.glm)
## [1] 14427.48
SCORE(data_test, pred.CART)
## [1] 12482.18
SCORE(data_test, pred.xgbTree)
## [1] 6068.619
SCORE(data_test, pred.rf)
## [1] 3765.605

The results correspond to our exploratory results that the performance of Random Forest is the best among others, and the performance of Gradient Boosted Decision Tree is better than the benchmark CART.

We used the following code to evaluate the patterns of time usage for different machine learning methods, namely CART, GBDT and RF. We used only one core to run a small-sized sample to make it fair for different methods.

We set two Small-size groups to compare the elapsed time for different methods. The first group used 100 to 1000 observations for each model. The second group used 1000 to 5000 observations for each model. The interval in the first group is 100, and the interval in the second group is 1000.

Ans_CART <- matrix(rep(NA, 30), nrow = 10)

for (i in 1:10){
  ptm <- proc.time()
  fit.CART <- train(Weekly_Sales ~ ., data = data_train[1:(100 * i),], method = "rpart1SE",
               trControl = trainControl("cv", number = 10))
  A <- proc.time() - ptm
  Ans_CART[i,] <- A[1:3] %>% as.numeric()
}

colnames(Ans_CART) <- c("User", "System", "Elapsed")
save(Ans_CART, file = "Ans_CART.rda")
Ans_5k_CART <- matrix(rep(NA, 15), nrow = 5)

for (i in 1:5){
  ptm <- proc.time()
  fit.CART <- train(Weekly_Sales ~ ., data = data_train[1:(1000 * i),], method = "rpart1SE",
               trControl = trainControl("cv", number = 10))
  A <- proc.time() - ptm
  Ans_5k_CART[i,] <- A[1:3] %>% as.numeric()
}

colnames(Ans_5k_CART) <- c("User", "System", "Elapsed")
save(Ans_5k_CART, file = "Ans_5k_CART.rda")
Ans_xgbTree <- matrix(rep(NA, 30), nrow = 10)

for (i in 1:10){
  ptm <- proc.time()
  fit.xgbTree <- train(Weekly_Sales ~ ., data = data_train[1:(100 * i),], method = "xgbTree",
               trControl = trainControl("cv", number = 10))
  A <- proc.time() - ptm
  Ans_xgbTree[i,] <- A[1:3] %>% as.numeric()
}

colnames(Ans_xgbTree) <- c("User", "System", "Elapsed")
save(Ans_xgbTree, file = "Ans_xgbTree.rda")
Ans_5k_xgbTree <- matrix(rep(NA, 15), nrow = 5)

for (i in 1:5){
  ptm <- proc.time()
  fit.xgbTree <- train(Weekly_Sales ~ ., data = data_train[1:(1000 * i),], method = "xgbTree",
               trControl = trainControl("cv", number = 10))
  A <- proc.time() - ptm
  Ans_5k_xgbTree[i,] <- A[1:3] %>% as.numeric()
}

colnames(Ans_5k_xgbTree) <- c("User", "System", "Elapsed")
save(Ans_5k_xgbTree, file = "Ans_5k_xgbTree.rda")
Ans_RF <- matrix(rep(NA, 30), nrow = 10)

for (i in 1:10){
  ptm <- proc.time()
  fit.rf <- train(Weekly_Sales ~ ., data = data_train[1:(100 * i),], method = "rf",
               trControl = trainControl("cv", number = 10))
  A <- proc.time() - ptm
  Ans_RF[i,] <- A[1:3] %>% as.numeric()
}

colnames(Ans_RF) <- c("User", "System", "Elapsed")
save(Ans_RF, file = "Ans_RF.rda")
Ans_5k_RF <- matrix(rep(NA, 15), nrow = 5)

for (i in 1:5){
  ptm <- proc.time()
  fit.rf <- train(Weekly_Sales ~ ., data = data_train[1:(1000 * i),], method = "rf",
               trControl = trainControl("cv", number = 10))
  A <- proc.time() - ptm
  Ans_5k_RF[i,] <- A[1:3] %>% as.numeric()
}

colnames(Ans_5k_RF) <- c("User", "System", "Elapsed")
save(Ans_5k_RF, file = "Ans_5k_RF.rda")

We used the following visualization to show the time usage pattern of different machine learning models.

load(file = "./Intermediate/Ans_CART.rda")
load(file = "./Intermediate/Ans_5k_CART.rda")
load(file = "./Intermediate/Ans_xgbTree.rda")
load(file = "./Intermediate/Ans_5k_xgbTree.rda")
load(file = "./Intermediate/Ans_RF.rda")
load(file = "./Intermediate/Ans_5k_RF.rda")
time.CART <- data.frame(time = Ans_CART[,3],
                      model = "CART")
time.rf <- data.frame(time = Ans_RF[,3], 
                      model = "RF")
time.xgbTree <- data.frame(time = Ans_xgbTree[,3], 
                      model = "xgbTree")

time.all <- rbind(time.CART, time.rf, time.xgbTree) %>% as.data.frame()
num <- rep(seq(100, 1000, by = 100), 3)
time.all$number <- num

## predicted sales from different models in comparison with real sales
plot_1 <- ggplot(time.all) +
  geom_line(aes(x = number, y = time, col = model), size = 1.5) +
  labs(x = "Observation Capacity", y = "Elapsed Time", title = "Time Usage on Small Sample Group I")
time.CART.5k <- data.frame(time = Ans_5k_CART[,3],
                      model = "CART")
time.rf.5k <- data.frame(time = Ans_5k_RF[,3], 
                      model = "RF")
time.xgbTree.5k <- data.frame(time = Ans_5k_xgbTree[,3], 
                      model = "xgbTree")

time.all.5k <- rbind(time.CART.5k, time.rf.5k, time.xgbTree.5k) %>% as.data.frame()
num <- rep(seq(1000, 5000, by = 1000), 3)
time.all.5k$number <- num

## predicted sales from different models in comparison with real sales
plot_2 <- ggplot(time.all.5k) +
  geom_line(aes(x = number, y = time, col = model), size = 1.5) +
  labs(title = "Time Usage on Small Sample Group II", x = "Observation Capacity", y = "Elapsed Time")
plot_combine <- ggarrange(plot_1, plot_2, ncol = 2, widths = 12, heights = 5)
plot_combine

The speed of the primitive CART algorithm is very fast in the first small sample size group.

On the first small sample group with sample size from 100 to 1000, the speed of RF is faster than Gradient Boosted Decision Tree. However, the elapsed time of RF shows an exponential pattern, which indicates that RF could be very slow when the sample size is big.

For the second group with sample size from 1000 to 5000, the speed of the primitive CART algorithm is still very fast.

On a sample size from 1000 to 5000, The elapsed time of RF maintains an exponential pattern. However, the elapsed time of GBDT does not increase greatly. When the sample size is 5000, the elapsed time of RF is already around ten times as much as the elapsed time of GBDT.

When the sample size is even larger on ~300000 level, the Random Forest algorithm will cost an extremely long period to generate an answer.

Advice for sales manager in Walmart:

  • For best performance, Random Forest would be our recommended way to predict future sales since it gives the best fitting score.

  • The time usage of Random Forest is very high given the data is large, so the sales managers should be careful in implementing this method.

  • Since Gradient Boosted Decision Tree gives acceptable prediction performance with much better efficiency than Random Forest, the sales manager should take this method into consideration.

More work regarding to Online sales and Offline sales

In the previous sections, we only studied a representative of offline retail – Walmart offline retail. This section, we aim to study the characteristics of online retail and offline retail, which can give more insightful advice for companies having both service online and offline.

We used the retail sales data (which have been described in the “data” section) to get the estimated quarterly retail sales. We assumed the E-commerce retail (\(EC\)) sales in the retail sales dataset can represent the online sales, and the non-E-commerce (\(NEC= Total-ECommerce\)) can represent the offline sales. To control for the change of GDP over time, we also used the GDP data.

We did a preliminary study from two perspectives. First, the tolerance of online and offline sales to financial impact of the market. Second, the quartly trend of online and offline sales.

Data Tidying and wrangling

Because the 1999 and 2019 data are not completed, we removed the data for these two years. After this, there are \(4*19*2\) retail sales data for 4 quaters, 19 years, Total/E-commerce retail sales in the US.

ECom <- read_excel("./Data/tsnotadjustedsales.xls")

colnames(ECom) <- c("Quarter", "Sales_TOT", "Sales_EC", "EC_percent", 
                    "Per.Change.Prior.Qua_TOT", "Per.Change.Prior.Qua_EC", 
                    "Per.Change.Prior.Year_TOT", "Per.Change.Prior.Year_EC")
ECom <- ECom[c(-1:-8, -89:-100),]

## remove 2019 & 1999 data
ECom <- ECom[c(-1:-3, -80),]

ECom$Sales_TOT <- as.integer(ECom$Sales_TOT)
ECom$Sales_EC <- as.integer(ECom$Sales_EC)

We converted the data into online and offline sales, i.e. \(online=EC\) and \(offline = Total-EC\). Then we identified the Year and Quarter anad put them into separate columns. Finally, the quarterly GDP is merged into the table. The cleaned format data table was saved into “ECom1”.

## Transform the data to cleaned form with variable -- Year, Quarter, Sales_NEC, Sales_EC
ECom1 <- data.frame(Year = as.integer(unlist(lapply(str_split(ECom$Quarter, " quarter "), 
                                                    FUN = function(x){x[2]}))),
                    Quarter = unlist(lapply(str_split(ECom$Quarter, " quarter "), 
                                            FUN = function(x){x[1]})),
                    Sales_NEC = (ECom$Sales_TOT - ECom$Sales_EC), 
                    Sales_EC = (ECom$Sales_EC), stringsAsFactors = F)

## merge GDP data into our table
GDP <- read_csv("./Data/GDP1.csv")
## Parsed with column specification:
## cols(
##   DATE = col_date(format = ""),
##   GDP = col_double()
## )
tmp <- str_split(GDP$DATE, "-")
GDP$GDP <- 1000*GDP$GDP
GDP$Year <- as.integer(unlist(lapply(tmp, FUN = function(x){x[1]})))
GDP$Quarter <- unlist(lapply(tmp, FUN = function(x){
  a <- x[2]
  if(a=="01"){return("1st")}
  if(a=="04"){return("2nd")}
  if(a=="07"){return("3rd")}
  if(a=="10"){return("4th")}
}))
ECom1 <- inner_join(ECom1, GDP[,-1], by = c("Year", "Quarter"))

Exploratory data analysis

Below is the plot of the temporal trend of GDP, offline and online retail sales.

## temporal trend of GDP 
tmp <- rbind(data.frame(quart.num = dim(ECom1)[1]:1, 
                        Sales = ECom1$GDP/1000000,
                        Type = rep("GDP", dim(ECom1)[1])))
tmp$quart.num <- factor(tmp$quart.num, level = 1:dim(ECom1)[1])
p1 <- ggplot(tmp, aes(x = quart.num, y = Sales, group = Type), color = "orange") + 
  geom_line() + geom_point() +
  xlab("") + ylab("GDP \n(10e+6 Billions \nof Dollars)") +
  scale_x_discrete(breaks = as.character(seq(from = 1, to = 76, by = 8)), 
                     labels = as.character(seq(from = 2000, to = 2018, by = 2)))

## temporal trend of offline 
tmp <- rbind(data.frame(quart.num = dim(ECom1)[1]:1, 
                  Sales = ECom1$Sales_NEC/100000,
                  Type = rep("Offline", dim(ECom1)[1])))
tmp$quart.num <- factor(tmp$quart.num,level = 1:dim(ECom1)[1])
p2 <- ggplot(tmp, aes(x = quart.num, y = Sales, group = Type), color = "red") + 
  geom_line() + geom_point() +
  xlab("") + ylab("Offline sales \n(10e+5 Millions \nof Dollars)") +
  scale_x_discrete(breaks = as.character(seq(from = 1, to = 76, by = 8)), 
                     labels = as.character(seq(from = 2000, to = 2018, by = 2)))

## temporal trend of online 
tmp <- data.frame(quart.num = dim(ECom1)[1]:1, 
                        Sales = ECom1$Sales_EC/10000,
                        Type = rep("Online", dim(ECom1)[1]))
tmp$quart.num <- factor(tmp$quart.num,level = 1:dim(ECom1)[1])
p3 <- ggplot(tmp, aes(x = quart.num, y = Sales, group = Type), color = "blue") + 
  geom_line() + geom_point() +
  xlab("Year") + ylab("Online sales \n(10e+4 Millions \nof Dollars)") +
  scale_x_discrete(breaks = as.character(seq(from = 1, to = 76, by = 8)), 
                     labels = as.character(seq(from = 2000, to = 2018, by = 2)))

grid.arrange(p1, p2, p3, nrow = 3)

We can see the impact from the financial crisis in 2008 obviously from the plot.

The overall temporal trend for offline retail sales are similar to GDP – a same decrease in 2008, and overall incrase with year. But the difference is that the GDP is smoothly increase over quarters within a year, while the offline has quarterly fluctuations. In contrast to GDP and offline retail sales, the online retail sales did not be affected much due to the financial crisis and it remained stable over years, which is surprising.

Both online and offline retail sales have quarterly fluctuations. But the patterns are different – online retail sales is very high in the fourth quarter compared with the other three quarters, and almost in a stable level in the first three quaters; while offline retail sales is lower in the 1st quarter, and a little higher in the 4th quarter.

Statistical analysis

Based on the findings in EDA, we have two aims in this statistical analysis part

  • Did financial crisis in 2018 significantly affect both online and offline retail sales?

  • Do the patterns of quarterly fluctuations of online and offline retail sales are the same?

Did financial crisis in 2018 significantly affect both online and offline retail sales?

In the prior section (exploratory data analysis), we found that GDP is smoothly increase, except for a decrease around 2008, the time of financial crisis. In the GDP sales curve, the GDP begin the drop from 3rd quarter of 2018 and went back to the same level until 3rd quarter of 2010. Similar to the GDP drop around 2018, the offline retail sales curve also has a drop during that time. But the influence of financial crisis to the online retail sales seems not so significant.

In this part, we want to test whether the drop of GDP and offline retail sales during that time is statistically significant; and whether there is also significant drop for online retail sales.

Therefore, we assume the three curves are consist of three linear components –

  1. before 3rd quarter of 2018

  2. from 3rd quarter of 2018 to 3rd quarter of 2010

  3. after 3rd quarter of 2010

Under this assumption, we model them with three linear splines, breaking at 3rd quarter of 2018 and 3rd quarter of 2010. We aim to test whether the second spline is 0.

## make a temporary table with linear splines
## test whether the second spline (modelled for the financial crisis) is significant for GDP

tmp <- rbind(data.frame(quart.num = dim(ECom1)[1]:1, 
                  Sales = ECom1$GDP,
                  Type = rep("GDP", dim(ECom1)[1])))
tmp1 = tmp %>% 
  mutate(spline2 = ifelse(quart.num > 35, quart.num - 35, 0)) %>%
  mutate(spline3 = ifelse(quart.num > 43, quart.num - 43, 0)) %>% 
  mutate(spline1 = quart.num)

summary(fit1 <- lm(Sales ~ spline1 + spline2 + spline3, data = tmp1))
## 
## Call:
## lm(formula = Sales ~ spline1 + spline2 + spline3, data = tmp1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -479275 -178594   21893  144924  358520 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9589198      72522  132.22  < 2e-16 ***
## spline1       149981       3346   44.83  < 2e-16 ***
## spline2      -152116      14407  -10.56 2.82e-16 ***
## spline3       177025      14528   12.19  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 214900 on 72 degrees of freedom
## Multiple R-squared:  0.9952, Adjusted R-squared:  0.995 
## F-statistic:  4958 on 3 and 72 DF,  p-value: < 2.2e-16

For the GDP curve. The results coefficient for spline 2 is significant not zero, and the estimate is \(-152,116\) with p value of (\(2.82e-16\)), which means there is a significant drop of the GDP during that period.

## make a temporary table with linear splines
## test whether the second spline (modelled for the financial crisis) is significant for offline

tmp <- rbind(data.frame(quart.num = dim(ECom1)[1]:1, 
                  Sales = ECom1$Sales_NEC,
                  Type = rep("offline", dim(ECom1)[1])))
tmp1 = tmp %>% 
  mutate(spline2 = ifelse(quart.num > 35, quart.num - 35, 0)) %>%
  mutate(spline3 = ifelse(quart.num > 43, quart.num - 43, 0)) %>% 
  mutate(spline1 = quart.num)

summary(fit2 <- lm(Sales ~ spline1 + spline2 + spline3, data = tmp1))
## 
## Call:
## lm(formula = Sales ~ spline1 + spline2 + spline3, data = tmp1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -159966  -30503    8082   37842   74115 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 711918.4    16653.6  42.749  < 2e-16 ***
## spline1       7257.3      768.3   9.446  3.1e-14 ***
## spline2     -12113.2     3308.3  -3.661 0.000475 ***
## spline3      13492.4     3336.2   4.044 0.000130 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49350 on 72 degrees of freedom
## Multiple R-squared:  0.8829, Adjusted R-squared:  0.878 
## F-statistic:   181 on 3 and 72 DF,  p-value: < 2.2e-16

For the offline retail sales curve. The results coefficient for spline 2 is significant not zero, and the estimate is \(-12,113.2\) with p value of (\(4.75e-4\)), which means there is a significant drop of the offline retail sales during that period.

## make a temporary table with linear splines
## test whether the second spline (modelled for the financial crisis) is significant for online

tmp <- rbind(data.frame(quart.num = dim(ECom1)[1]:1, 
                  Sales = ECom1$Sales_EC,
                  Type = rep("online", dim(ECom1)[1])))
tmp1 = tmp %>% 
  mutate(spline2 = ifelse(quart.num > 35, quart.num - 35, 0)) %>%
  mutate(spline3 = ifelse(quart.num > 43, quart.num - 43, 0)) %>% 
  mutate(spline1 = quart.num)

summary(fit3 <- lm(Sales ~ spline1 + spline2 + spline3, data = tmp1))
## 
## Call:
## lm(formula = Sales ~ spline1 + spline2 + spline3, data = tmp1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11881  -3448  -1259   2043  31144 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1843.4     2755.9   0.669   0.5057    
## spline1        977.7      127.1   7.690 5.81e-11 ***
## spline2       -913.5      547.5  -1.669   0.0996 .  
## spline3       2759.2      552.1   4.998 3.93e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8166 on 72 degrees of freedom
## Multiple R-squared:  0.9546, Adjusted R-squared:  0.9527 
## F-statistic: 504.2 on 3 and 72 DF,  p-value: < 2.2e-16

For the online retail sales curve. The results coefficient for spline 2 is significant not zero, and the estimate is \(-913.5\) with p value of (\(0.10\)). Though the regression coefficient is negative, the test is not statistically significant. Therefore, the online retail sales did not significantly drop during that period.

We finally plotted the regression lines, which gives us more straight-forward way of the impact of financial crisis for them.

## Plots of regression line for GDP
tmp.GDP <- rbind(data.frame(quart.num = dim(ECom1)[1]:1, 
                        Sales = ECom1$GDP/1000000,
                        Type = rep("GDP", dim(ECom1)[1])))
tmp.GDP$quart.num <- factor(tmp.GDP$quart.num, level = 1:dim(ECom1)[1])
p1 <- ggplot(tmp.GDP, aes(x = quart.num, y = Sales, group = Type)) + 
  geom_line() +
  geom_point() +
  xlab("") + ylab("GDP \n(10e+6 Billions \nof Dollars)") +
  scale_x_discrete(breaks = as.character(seq(from = 1, to = 76, by = 8)), 
                     labels = as.character(seq(from = 2000, to = 2018, by = 2))) +
  geom_line(aes(x = quart.num, y = fit1$fitted.values/1000000), color = "red")

## Plots of regression line for offline
tmp.offline <- rbind(data.frame(quart.num = dim(ECom1)[1]:1, 
                  Sales = ECom1$Sales_NEC/100000,
                  Type = rep("Offline", dim(ECom1)[1])))
tmp.offline$quart.num <- factor(tmp.offline$quart.num,level = 1:dim(ECom1)[1])
p2 <- ggplot(tmp.offline, aes(x = quart.num, y = Sales, group = Type)) + 
  geom_line()+ 
  geom_point() +
  xlab("") + ylab("Offline sales \n(10e+5 Millions \nof Dollars)") +
  scale_x_discrete(breaks = as.character(seq(from = 1, to = 76, by = 8)), 
                     labels = as.character(seq(from = 2000, to = 2018, by = 2))) +
  geom_line(aes(x = quart.num, y = fit2$fitted.values/100000), color = "red")

## Plots of regression line for online
tmp.online <- data.frame(quart.num = dim(ECom1)[1]:1, 
                        Sales = ECom1$Sales_EC/10000,
                        Type = rep("Online", dim(ECom1)[1]))
tmp.online$quart.num <- factor(tmp.online$quart.num,level = 1:dim(ECom1)[1])
p3 <- ggplot(tmp.online, aes(x = quart.num, y = Sales, group = Type)) + 
  geom_line() +
  geom_point() +
  xlab("Year") + ylab("Online sales \n(10e+4 Millions \nof Dollars)") +
  scale_x_discrete(breaks = as.character(seq(from = 1, to = 76, by = 8)), 
                     labels = as.character(seq(from = 2000, to = 2018, by = 2))) +
  geom_line(aes(x = quart.num, y = fit3$fitted.values/10000), color = "red")

## plot
grid.arrange(p1, p2, p3, nrow = 3)

# jpeg(filename = "./ecom_year.jpeg", width = 1500, height = 1000, res = 180)
# grid.arrange(p1, p2, p3, nrow = 3)
# dev.off()

By this analysis, we find that the online retail sales is more tolerant for financial impact, such as the financial crisis in 2008. This can inspire the sales manager that if a potential financial crisis is coming, they can put more goods and personnels to the online market, and develop effective strategies for online sales, which may compensate for the great impact of offline sales due to the financial crisis.

But our analysis has some limitations: First, we only have data for one financial crisis which happened around 2008, which limits the power and reliability of our analysis. Second, the offline retails develped much better than the online retails at the year 2008, so more people would purchase products from offline stores at that time. This may lead to a more dramatic drop of the offline sales due to some financial impact to the market at that time. So it is hard for us to predict for the consequence of future financial crisis when the offline retail and online retail have both well developed.

Do the quarterly (seasonal) patterns of online and offline retail sales are same?

This can provide a clue for Walmart (and other companies which have both online and offline stores) on how to make marketing policies regarding to the distribution of online and offline products on certain time period.

In the EDA, we found some patterns of the quarterly fluctuations of online and offline retail sales

  • For online, sales is extremely high in the 4th quarter; but somehow low for 1st, 2nd, and 3rd quarters.

  • For offline, sales is obvious lower in the 1st quarter, and a little higher in the 4th quarter; while the sales in 2nd and 3rd quarters are about in the median level.

To control for the yearly increase of online and offline sales, we used the percentage of quarterly sales within a year to study the pattern of quarterly fluctuations. So first, we reforamteed the data and used new variables in this part of study –

  1. For online, online retail sales percentage in the \(i^{th}\) quarter in \(j^{th}\) year = online sales in \(i^{th}\) quarter in \(j^{th}\) year / total online sales in \(j^{th}\) year.

  2. For offline, offline retail sales percentage in the \(i^{th}\) quarter in \(j^{th}\) year = offline sales in \(i^{th}\) quarter in \(j^{th}\) year / total offline sales in \(j^{th}\) year.

The new variables are saved as “Scaled_NEC” and “Scaled_EC”.

## transform the offline and online retail sales into the percentage of sales in a year 

Scaled_NEC <- numeric()
Scaled_EC <- numeric()
year <- unique(ECom1$Year)
for (y in year) {
  tmp <- ECom1[ECom1$Year == y,3]
  Scaled_NEC <- c(Scaled_NEC, tmp/sum(tmp)*100)
  tmp <- ECom1[ECom1$Year == y,4]
  Scaled_EC <- c(Scaled_EC, tmp/sum(tmp)*100)
}
ECom1 <- data.frame(ECom1, Scaled_NEC = Scaled_NEC, Scaled_EC = Scaled_EC)

Using the new variables, we can see obvious quarterly trend exist for the online and offline retail sales in the picture below.

## plot of the quarterly trend of online and offline retail sales

tmp <- rbind(data.frame(quart.num = rep(4:1, dim(ECom1)[1]/4), 
                        Sales = ECom1$Scaled_NEC,
                        Type = rep("Offline", dim(ECom1)[1])), 
             data.frame(quart.num = rep(4:1, dim(ECom1)[1]/4), 
                        Sales = ECom1$Scaled_EC,
                        Type = rep("Online", dim(ECom1)[1])))

ggplot(tmp, aes(x = quart.num, y = Sales, group = Type, color = Type)) + 
  geom_jitter(alpha = 0.5, width = 0.3) + 
  ggtitle("Quarterly trend of sales") +
  xlab("Quarter") + ylab("Percentage of sales in a year (%)")

First we assumed that the online and offline retail sales both have a linear quarterly trend. So we used linear model to fit the data. The coefficients in the model represent the increase rate of the percentage of sales. We wanted to test whether the increase rates of the percentage of sales are same for the two retail type, which allows us to know the whether the two retail types have same quartly trend.

In the regression results below, the coefficients are statistically significant (all have p-values much less than 0.05). Below, we added the two regression lines to the plot which indicate the quarterly increase rate are different for the two retail types.

summary(fit1 <- lm(Sales ~ quart.num * Type, data = tmp))
## 
## Call:
## lm(formula = Sales ~ quart.num * Type, data = tmp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8425 -0.6486 -0.1624  0.9173  3.6989 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           22.6412     0.4198  53.938  < 2e-16 ***
## quart.num              0.9435     0.1533   6.156 6.68e-09 ***
## TypeOnline            -4.7696     0.5936  -8.034 2.72e-13 ***
## quart.num:TypeOnline   1.9078     0.2168   8.801 3.27e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.494 on 148 degrees of freedom
## Multiple R-squared:  0.7218, Adjusted R-squared:  0.7161 
## F-statistic:   128 on 3 and 148 DF,  p-value: < 2.2e-16
## plot with regression lines
ggplot(tmp, aes(x = quart.num, y = Sales, group = Type, color = Type)) + 
  geom_jitter(alpha = 0.5, width = 0.3) + 
  ggtitle("Quarterly trend of sales") +
  xlab("Quarter") + ylab("Percentage of sales in a year (%)") +
  geom_line(aes(x = quart.num, y = fit1$fitted.values, color = Type), lwd=1)

To statistically check whether the difference is significant, we did F test, to test for the interaction term of retail type and increase rate. Under the null hypothesis (H0: there is no interaction between retail type and increase rate), F-statistics follow \(F_{148, 149}\) in our model. As the result shown below, the p-value for the F test is \(3.27e-15\), which means the increase rate for the two retail types are significantly different.

## F test for the interaction term
fit2 <- lm(Sales ~ quart.num + Type, data = tmp)
anova(fit1, fit2)
## Analysis of Variance Table
## 
## Model 1: Sales ~ quart.num * Type
## Model 2: Sales ~ quart.num + Type
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1    148 330.32                                 
## 2    149 503.21 -1   -172.89 77.463 3.27e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Further, we checked the residual plot of our analysis. As the residual plot shown below, it’s obviously that the residuals are not normal distributed. So the assumption of our model (linear increase rate) may not be true.

## residual plot
qplot(y = fit1$residuals, x = jitter(quart.num), color = Type, data = tmp, xlab = "Quarter", ylab = "Residuals")

To better model the quartly trend for the two retail type, we no longer assumed that the online and offline retail sales both have a linear quarterly trend. Instead, we used natural splines to fit the data. We wanted to test whether the natural splines for the two retail types are same, which allows us to know the whether the two retail types have same quartly trend.

In the regression results below, the coefficients are statistically significant (all have p-values much less than 0.05). Below, we added the two lines to the plot which indicate the quarterly trends are different for the two retail types.

## adding natural spline term

summary(fit1 <- lm(Sales ~ ns(quart.num, df = 3) * Type, data = tmp))
## 
## Call:
## lm(formula = Sales ~ ns(quart.num, df = 3) * Type, data = tmp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3726 -0.1948 -0.0132  0.1923  2.1212 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        23.1217     0.1298 178.162  < 2e-16 ***
## ns(quart.num, df = 3)1              0.7169     0.2618   2.738 0.006961 ** 
## ns(quart.num, df = 3)2              5.1690     0.3211  16.098  < 2e-16 ***
## ns(quart.num, df = 3)3              1.5274     0.1776   8.602 1.21e-14 ***
## TypeOnline                         -1.0395     0.1835  -5.664 7.77e-08 ***
## ns(quart.num, df = 3)1:TypeOnline  -1.2949     0.3703  -3.497 0.000626 ***
## ns(quart.num, df = 3)2:TypeOnline   2.0051     0.4541   4.416 1.96e-05 ***
## ns(quart.num, df = 3)3:TypeOnline   6.9903     0.2511  27.839  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5657 on 144 degrees of freedom
## Multiple R-squared:  0.9612, Adjusted R-squared:  0.9593 
## F-statistic: 509.4 on 7 and 144 DF,  p-value: < 2.2e-16
## plots with splines

grid <-  seq(from=1, to = 4, by=0.01)
nD <- data.frame(quart.num = rep(grid, 2),
           Type = c(rep("Offline", length(grid)), rep("Online", length(grid))))
nD$predict <- predict(fit1, nD)
p <- ggplot(tmp, aes(x = quart.num, y = Sales, group = Type, color = Type)) + 
  geom_jitter(alpha = 0.5, width = 0.3) + 
  ggtitle("Quarterly trend of sales") +
  xlab("Quarter") + ylab("Percentage of sales in a year (%)") +
  geom_line(data = nD, aes(x = quart.num, y = predict, color = Type), lwd=1)
# jpeg(filename = "./ecom_quarter.jpeg", width = 1100, height = 1000, res = 160)
# print(p)
# dev.off()
p

To statistically check whether the difference is significant, we did F test, to test for the interaction term of retail type and natural splines of quarters. Under the null hypothesis (H0: there is no interaction between them), F-statistics follow \(F_{144, 147}\) in our model. As the result shown below, the p-value for the F test is \(2.2e-15\), which means the quarterly trend of the two retail types are significantly different.

## F-test for the interaction of spline term and type of sales

fit2 <- lm(Sales ~ ns(quart.num, df = 3) + Type, data = tmp)
anova(fit1,fit2)
## Analysis of Variance Table
## 
## Model 1: Sales ~ ns(quart.num, df = 3) * Type
## Model 2: Sales ~ ns(quart.num, df = 3) + Type
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    144  46.08                                  
## 2    147 372.50 -3   -326.42 340.01 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Further, we checked the residual plot. As the residual plot shown below, the residuals are approximately normal distributed, which gives us confidence for our results.

## Residual plots

qplot(y=fit1$residuals, x = jitter(quart.num), color = Type, data = tmp, xlab = "Quarter", ylab = "Residuals")

Summary and Recommendations

Summary

In this data analysis report, we used weekly retail sales data from Kaggle website. We aimed at finding influential factors that affect retail sales and giving some advice to sales managers. We used different models to analyze the covariates and different machine learning techniques to help sales managers better predict sales. Based on this, we can make some suggestions to sales managers on how to implement promotions to make bigger profit.

In the exploratory data analysis, we first plotted the heatmap of the correlation of different covariates. We found that there are almost no correlations between covariates, which is good for our model fitting and prediction methods. Then we conducted analysis on weekly sales’ temporal trend and relationship between different covariates and weekly sales stratified by type of store. We found several interesting interactions between type and covariates.

In the model fitting part, we first fitted linear model using all covariates and interactions between type of the stores and these covariates. However, when we deleted the outlier and conducted model checking, we found that the residual is not normally distributed. Therefore we turned to GEE model and GLM model using quasi-likelihood. These two models gave very similar results and the interpretation coincides with our observation in the exploratory data analysis, which give us confidence in our results. We found that:

  • We divide the covariates into “helpers” and “suppressors”:
    Positively correlated with weekly sales (helpers): Is Holiday indicator, CPI, Time, Temperature
    Negatively correlated with weekly sales (suppressors): Unemployment rate, fuel price

  • We further explored the type-specific effect and found some very interesting interactions.

In the prediction analysis part, we used three different Tree-based machine learning regression algorithms inspired by the Tree-based classification methods taught in our class. We have learned the application of these algorithms and their time usage pattern from this data analysis project. The key findings are:

  • CART algorithm runs very fast for large sample size but the fitting is relatively poor, though slightly better than the classical GLM method.

  • Random Forest gives the best performance among all the machine learning methods we have used. However, the time consumption of Random Forest with large sample size could be a major restriction of its application in industry.

  • Since the implementation of Random Forest takes too much time, we would recommend the GBDT (Gradient Boosted Decision Tree) algorithm with both acceptable predicting performance and time usage.

Last but not the least, we conducted some studies about the comparison between online and offline retail sales. We used quarterly US retail sales data from 2000-2018 to do the analysis. We modelel the data with splines, and use T and F test to test the significance of those terms. By our analysis, we have observed and used statistical tests to justified:

  • There is significant drop of GDP and offline retail sales during the notable financial crisis period around 2008. But online retail sales remain steay during that period.

  • The seasonal patterns of online and offline sales display different patterns. The percentage of online sales is extremely high.

Limitations

  • In the glm model, The residual deviance is 62,400 on 421,433 degrees of freedom. Even though there may be a problem of overfitting, but we use quasi-likelihood to make a control and the model is convincing. What’s more, we only consider linear interactions but there may be some nonlinear interactions between covariates.

  • Our research has shown that the elapsed time for the Random Forest algorithm is roughly exponential to the sample size. When the sample size is big with around 300000 observations, the time consumption of Random Forest is way beyond the level of acceptance. Our Strategy to handle this problem is to use partial data for prediction.

  • We only have data for one financial crisis which happened around 2008. This limits the power and reliability of our analysis for the drop of offline and online sales. In addition, the offline retails develped much better than the online retails at the year 2008, so more people would purchase products from offline stores at that time. This may lead to a more dramatic drop of the offline sales due to some financial impact to the market. So it is hard for us to predict for the consequence of future financial crisis when the offline retails and online retails have both well developed.

Recommendations

  • Holiday effect is statistically siginificant. Therefore, stores can host some sales promotion activities on national holidays to increase their profit.

  • It helps to make more promotion activities in the first quarter to attract customers.

  • For best performance, Random Forest would be our recommended way to predict future sales since it gives the best fitting score.

  • The time usage of Random Forest is very high given the data is large, so the sales managers should be careful in implementing this method.

  • Since Gradient Boosted Decision Tree gives acceptable prediction performance with much better efficiency than Random Forest, the sales manager should take this method into consideration.

  • When facing a financial crisis, or financial decrease, more resources can be moved to the online department.

  • In the fourth quarter, there is great increase of online sales, which may need more resources and personnels, such as online agents.

Future work

  • We may need to consider more nonlinear interaction term in our fitted model. More complicated statistical models, such as generalized linear mixed models considering department as random effect factors, can also be implemented.

  • In the future, we could consider additional predictors, such as the exact location of each stores, and include them in the model for a better sales record prediction.